# COMPSCI 389: Introduction to Machine Learning
# Topic 10.3 Automatic Differentiation for ML

In this notebook we show how automatic differentiation can be used for ML by running gradient descent on the sample MSE for a linear parametric model fit to the GPA data set.

First, here are the import statements we will use. We will use a train-test split and standardization. We will also use `shuffle` from `sklearn.utils` to shuffle data points.

In [1]:
import autograd.numpy as np
from autograd import grad
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

Next, let's load the GPA data set, split it into inputs `X` and labels `y`. Unlike before, we will make these `ndarray` objects from `autograd.numpy` by calling `.values` after `.iloc[...]`. Remember that you can load the GPA data set directly from online (the upper line), or from a local download (the commented out lower line).

In [2]:
df = pd.read_csv("https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas
# df = pd.read_csv("data/GPA.csv", delimiter=',')

# Split into features and labels
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values.reshape(-1, 1) # Reshape for vectorized operations

Next, let's split into training and testing sets, using $80\%$ of the data for training and $20\%$ for testing.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, let's use `StandardScaler` to pre-process the inputs. Notice that we use `fit_transform` on the training data to find the necessary values (in this case the mean and standard deviation). At test time, we then want to use these same values, since our model was trained under the assumption that the data would be pre-processed using these values. Hence, we use `transform` rather than `fit_transform` when pre-processing the testing data.

While it can be reasonable to run `fit_transform` once on all of the data, this would result in the model being (partially) computed from the testing data. The approach used below of calling `fit_transform` on just the training data ensures that the testing data does not influence the learned model in any way.

In [4]:
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # This sets the min/max values from the training data (without looking at the testing)
X_test = scaler.transform(X_test) # This uses the min/max scaling values chosen during training! (transform, not fit_transform)

Next, since we won't be using a basis, let's append a column of ones to the `X_train` and `X_test` numpy arrays:

In [5]:
X_train = np.c_[np.ones(X_train.shape[0]), X_train]
X_test = np.c_[np.ones(X_test.shape[0]), X_test]

Next, let's define our linear parametric model. This implements:

$$
f_w(x_i) = \sum_{j=1}^d w_j x_{i,j}.
$$

Or, using dot-product notation,

$$
f_w(x_i) = w \cdot x_i.
$$

For efficiency, our code will use this dot-product approach. Inside of numpy, this dot product will be computed using a loop over weights, but this loop will be implemented in a more efficient language (likely C or C++) rather than Python.

In [6]:
def linear_model(X, weights):
 return np.dot(X, weights)

Next, let's implement our loss function, which computes teh sample MSE.

In [7]:
def loss_function(weights, X, y):
 predictions = linear_model(X, weights)
 return np.mean((predictions - y)**2)

Next, let's use autograd to get the gradient of the loss function. We want the gradient with respect to `weights` not `X` or `y`. Remember that the `grad` function defaults to providing the derivative with respect to the first input, which is the desired behavior here.

In [8]:
grad_loss = grad(loss_function) # Defaults to grad(loss_function, 0)

Next, lets select the initial weight vector. Common strategies are to use all-zeros or to use random values (from some distribution, e.g., a normal distribution).

In [9]:
num_weights = X_train.shape[1]
#weights = np.zeros((num_weights, 1)) # Start with all weights being zero
weights = np.random.randn(num_weights, 1) # Sample (numWeights x 1) values from a standard normal distribution

Finally, let's run 50 iterations of gradient descent with a step size (learning rate) of $0.05$.

In [10]:
num_iterations = 50
learning_rate = 0.05

# Training loop
for iteration in range(num_iterations):
 weights -= learning_rate * grad_loss(weights, X_train, y_train)

 # Print loss every 10 iterations
 if iteration % 10 == 0:
 current_loss = loss_function(weights, X_train, y_train)
 print(f"Iteration {iteration}, Loss: {current_loss}")

# Evaluate on test data
test_loss = loss_function(weights, X_test, y_test)
print(f"Test MSE: {test_loss}")


Iteration 0, Loss: 5.260402217978024
Iteration 10, Loss: 1.2519262164232552
Iteration 20, Loss: 0.7295958047115758
Iteration 30, Loss: 0.627606681129744
Iteration 40, Loss: 0.5992668395272726
Test MSE: 0.5932749622703825


Remember when we worked out the derivatives necessary to do exactly this? That process was slow and error-prone. Using automatic differentiation techniques made this much easier. All you have to do is define your loss function and model, and autograd takes care of all of the derivatives for you!

# Epochs and Mini-Batches

When the data set is large, computing the sample MSE (or gradient of the sample MSE) for the entire training set can take a very long time.

**Idea**: Split the training data into **mini-batches**.
- Each mini-batch is a collection of several rows (training points).
- Each iteration of gradient descent can use a different mini-batch.
- The process of running gradient descent on all mini-batches one time is called an **epoch**.
 - Hence, each epoch corresponds to one pass over the entire data set, performing one gradient update for each mini-batch.
- Training typically involves running several epochs.
- Different splits of the data into mini-batches are typically used for each epoch.
- We typically define the size of each mini-batch, not the number of mini-batches.

Here is our code, updated to include mini-batches.

In [11]:
num_epochs = 50
learning_rate = 0.05
minibatch_size = 100

for epoch in range(num_epochs):
 # Shuffle the training data
 X_train_shuffled, y_train_shuffled = shuffle(X_train, y_train)

 # Loop over mini-batches
 for i in range(0, X_train.shape[0], minibatch_size):
 end = min(i + minibatch_size, X_train_shuffled.shape[0]) # The last mini-batch may be smaller than the others
 X_batch = X_train_shuffled[i:end]
 y_batch = y_train_shuffled[i:end]
 
 gradients = grad_loss(weights, X_batch, y_batch)
 weights -= learning_rate * gradients

 # Print loss every 10 epochs
 if epoch % 10 == 0:
 current_loss = loss_function(weights, X_train, y_train)
 print(f"Epoch {epoch}, Loss: {current_loss}")

# Evaluate on test data
test_loss = loss_function(weights, X_test, y_test)
print(f"Test MSE: {test_loss}")


Epoch 0, Loss: 0.5832311803176145
Epoch 10, Loss: 0.584201610532094
Epoch 20, Loss: 0.5832333795591383
Epoch 30, Loss: 0.5836298546523091
Epoch 40, Loss: 0.5840496234285119
Test MSE: 0.5892578927313206


Notice that the sample MSE reached lower values in fewer epochs! This is often the case - mini-batches not only make the amount of data used for each gradient computation more manageable, they often speed up the optimization process. The full reasoning for this is beyond the scope of the course, but is something we may discuss briefly in lecture.